📝 Objective: Perform the analysis of Metagenome-Assembled Genomes (MAGs) recovered from a public dataset using an end-to-end pipeline.

nf-core/mag

The pipeline used to build the MAGs was nf-core/mag, which integrates several tools to assemble the sequences and polish the recovered MAGs. The usage of the pipeline can be checked here, and below is the workflow it follows:

Since many of the downstream processes demand high computational resources, we have pre-computed some of them for you. However, we will explain step-by-step what we did or we are going to provide with the references to review what’s happening under the hood.

MAGFlow/BIgMAG

MAGFlow is a tool designed to combine several tools to measure the quality of the bins/MAGs, as well as to taxonomically annotate them. This is the workflow:

The output of this tool is a ready-to-use file that concatenates the results, and it can then used as input for BIgMAG. We will not execute MAGFlow today, and hence you will find the final_df.tsv to display the BIgMAG dashboard. Let’s create then the environment for BIgMAG by running the following commands on the GitHub Codespace terminal:

Bash

git clone https://github.com/jeffe107/BIgMAG
conda create -n BIgMAG --file BIgMAG/requirements.txt
conda activate BIgMAG

WARN: Whenever you are asked whether to install extra packages, please say yes to all!

Now, we are ready to execute BIgMAG to perform the exploratory analysis of the MAG overall quality and annotation:

Bash

 BIgMAG/app.py -p 8050 data/magflow/final_df.tsv

You will see on the terminal a link to the dashboard or the editor will offer you an option to directly open a new tab on the browser, just click on it.

Now, it is your turn to analyze the results, use these questions to guide your thoughts:

Question: Overall, the MAGs recovered within your assigned samples have good quality? What would you suggest to improve the quality of the bins that do not depict enough quality?

Question: The results displayed by BUSCO and CheckM2 are correspondant? Why do you think this is happening?

Question: What about the taxonomical classification, what would you report in terms of comparison among samples?

Question: Were the samples clustered as you expected? What caused this behavior?

Question: Is there any unsual sample or MAG that catches your attention? What further analysis you would propose to follow in order to dig into this special sample or MAG?

WARN: Do not forget to stop the dashboard with Ctrl + c, and to deactivate the environment with:

 conda deactivate

KEGG Decoder

Moving forward, it is time now to identify genomic features within the bins/MAGs, KEGG Decoder is a tool that interprets the metabolic potential of the MAGs (or any genome in general) by analyzing the presence of KEGG Orthology (KO) identifiers. It maps these KOs to major metabolic pathways and summarizes the completeness of each pathway based on predefined modules. Being so, we can infer the metabolic capabilities of the community, and gain insights into functional differences across samples or MAGs.

Prokka (included within nf-core/mag) has provided us with the annotation about the presence/localization of enzymes in the bins/MAGs that are involved in a wide variety of processes. Nonetheless, Prokka uses EC numbers to describe such genomic features, and as a result, we needed to transform these annotations to K numbers and merge them into a single file (this was already done for you). Now, let’s install KEGG decoder:

Bash

conda create -n keggdecoder python=3.6
conda activate keggdecoder
python3 -m pip install KEGGDecoder

WARN: Whenever you are asked whether to install extra packages, please say yes to all!

Now, we are ready to execute this software to determine the presence and completeness of the annotated metabolic pathways:

Bash

KEGG-decoder --input data/k_numbers/megahit_k_numbers.tsv --output kegg_output.tsv --vizoption static

This command will create a heatmap where you can perform a comparison across MAGs. In this case, we are analyzing the MAGs obtained with the assembler MEGAHIT. It should look like this:

On the y axis you will see just random letters per row, these are the corresponding names Prokka has asigned to the genomic features in the MAG. Below you can find a mapping file between the name of the bin/MAG and the contig name:

Bin ContigName
MEGAHIT-MaxBin2-ERR2143759.001.tsv GIFAEIHF
MEGAHIT-MaxBin2-ERR2143759.002.tsv CIJAKMNO
MEGAHIT-MaxBin2-ERR2143759.003.tsv IACMDEGO
MEGAHIT-MaxBin2-ERR2143759.004.tsv FKGEEGMK
MEGAHIT-MaxBin2-ERR2143759.005.tsv NBLBKFGP
MEGAHIT-MaxBin2-ERR2143760.001.tsv CDMCPOJD
MEGAHIT-MaxBin2-ERR2143760.002.tsv PJEJNNEC
MEGAHIT-MaxBin2-ERR2143760.003.tsv ELBPOCGG
MEGAHIT-MaxBin2-ERR2143760.004.tsv LPDNOCIE
MEGAHIT-MaxBin2-ERR2143771.001.tsv KEBGIGOI
MEGAHIT-MaxBin2-ERR2143771.002.tsv GNGAKHBD
MEGAHIT-MaxBin2-ERR2143772.001.tsv NGILDPFJ
MEGAHIT-MaxBin2-ERR2143772.002.tsv BNBNNLFK
MEGAHIT-MaxBin2-ERR2143773.001.tsv OLADMAGF
MEGAHIT-MaxBin2-ERR2143773.002.tsv EPNPGCJF
MEGAHIT-MaxBin2-ERR2143773.003.tsv OGHEFGBH
MEGAHIT-MetaBAT2-ERR2143759.1.tsv EOMIBOKB
MEGAHIT-MetaBAT2-ERR2143759.2.tsv IMFCMJNE
MEGAHIT-MetaBAT2-ERR2143759.3.tsv JPIOBONM
MEGAHIT-MetaBAT2-ERR2143759.4.tsv IIKMGHIJ
MEGAHIT-MetaBAT2-ERR2143759.5.tsv FJMNHDHH
MEGAHIT-MetaBAT2-ERR2143759.6.tsv HNFOOLAK
MEGAHIT-MetaBAT2-ERR2143759.7.tsv KKNACEHJ
MEGAHIT-MetaBAT2-ERR2143760.1.tsv JMOHPCEH
MEGAHIT-MetaBAT2-ERR2143760.2.tsv JNLKELBO
MEGAHIT-MetaBAT2-ERR2143760.3.tsv BPOICHMO
MEGAHIT-MetaBAT2-ERR2143760.4.tsv OJEIIDGO
MEGAHIT-MetaBAT2-ERR2143760.5.tsv OINBHHLA
MEGAHIT-MetaBAT2-ERR2143771.1.tsv EONPCBDF
MEGAHIT-MetaBAT2-ERR2143771.2.tsv OJCBHCCG
MEGAHIT-MetaBAT2-ERR2143771.3.tsv JADMGMAL
MEGAHIT-MetaBAT2-ERR2143771.4.tsv KKGCJLIG
MEGAHIT-MetaBAT2-ERR2143772.1.tsv KHPHFBJJ
MEGAHIT-MetaBAT2-ERR2143772.2.tsv NECFOIJH
MEGAHIT-MetaBAT2-ERR2143772.3.tsv KDCCMDHF
MEGAHIT-MetaBAT2-ERR2143772.4.tsv FDKDGLCP
MEGAHIT-MetaBAT2-ERR2143773.1.tsv IJLNKNIL
MEGAHIT-MetaBAT2-ERR2143773.2.tsv MACHMHHK
MEGAHIT-MetaBAT2-ERR2143773.3.tsv IGFKCGOH
MEGAHIT-MetaBAT2-ERR2143773.4.tsv LLMAHDOM

WARN: Do not forget to deactivate the environment with:

 conda deactivate

COG Annotation

Clusters of Orthologous Genes (COGs) are groups of genes from different organisms that evolved from a common ancestral gene and retain the same function. To explore these genes, we rely on the amino-acid sequences of the coding regions provided by Prokka (.faa files). The next task then will be to install a COGclassifier in order to detect Cluster of Ortholog Genes:

Bash

conda create -n cogclassifier -c conda-forge -c bioconda cogclassifier
conda activate cogclassifier

WARN: Whenever you are asked whether to install extra packages, please say yes to all!

This tool automatically perform the processes from searching query sequences into the COG database, to annotation and classification of gene functions, to generation of publication-ready figures. However, the input for this tool is only one genome each time. This does not mean that we can not analyze all the bins/MAGs, we could simply execute the software for each of them and integrate the data afterwards. To launch the tool just run on the terminal:

Bash

COGclassifier -i data/faas/MEGAHIT-MetaBAT2-ERR2143759.7.faa -o cog_annotation --download_dir ./cog_database

Once it has finished, inside cog_annotation you will see the output represented as tables featuring counts, summary and annotations, as well as interesting figures showcasing the proportion of the different COG categories, just like this:

Question: What are the categories that are more repsresentative of this MAG? Do you see any odd results?

Question: If you decide to analyze all of the MAGs, do you think it would be a fair comparison just using the raw counts obtained by the software? If not, what strategy would you propose to proceed further?

WARN: Do not forget to deactivate the environment with:

 conda deactivate

dbCAN3

CAZymes (Carbohydrate-Active enZymes) are enzymes involved in the breakdown, biosynthesis, or modification of carbohydrates and glycoconjugates. They play a crucial role in processing complex carbohydrates such as cellulose, hemicellulose, starch, and chitin, among others. To detect the presence of this kind of enzymes, we are going to use the tool dbCAN3, which is an automated web server designed to run the software and provide the results.

To execute the tool, you just need to download the file from Codespaces data/faas/MEGAHIT-MetaBAT2-ERR2143759.7.faa, upload it to the server (click here), submit the job and wait for the results.

It should look like this:

GeneID EC HMMER dbCAN_sub DIAMOND Signalp NofTools
OJEIIDGO_00076 None GH2(51-929) None None Y(1-22) 1
OJEIIDGO_00091 None GH20(151-512) None None Y(1-20) 1
OJEIIDGO_00128 None GH57(38-327) None None N 1
OJEIIDGO_00143 None CE1(19-254) None None N 1
OJEIIDGO_00151 None GT35(270-635) None None N 1
OJEIIDGO_00156 None GT41(135-688) None None N 1
OJEIIDGO_00164 None CE1(30-279) None None N 1
OJEIIDGO_00288 None GT83(81-370) None None N 1
OJEIIDGO_00300 None CE20(100-275)+CE20(404-530) None None N 1
OJEIIDGO_00303 None GH16_3(35-227) None None N 1
OJEIIDGO_00310 None GH23(76-208) None None Y(1-22) 1
OJEIIDGO_00341 None GH73(158-287) None None N 1
OJEIIDGO_00345 None GH177(38-426) None None Y(1-35) 1
OJEIIDGO_00393 None GH109(48-198) None None Y(1-30) 1
OJEIIDGO_00403 None GT117(15-234) None None N 1
OJEIIDGO_00418 None AA3(8-564) None None N 1
OJEIIDGO_00421 None GT51(39-219) None None N 1
OJEIIDGO_00582 None AA3(3-568) None None N 1
OJEIIDGO_00608 None GT2(3-113) None None N 1
OJEIIDGO_00632 None GH25(32-203) None None Y(1-17) 1
OJEIIDGO_00633 None CE7(21-301) None None N 1
OJEIIDGO_00690 None GH109(41-201) None None N 1
OJEIIDGO_00711 None GT4(162-301) None None N 1
OJEIIDGO_00726 None GT2(7-137) None None N 1
OJEIIDGO_00790 None GH177(39-417) None None N 1
OJEIIDGO_00879 None GH73(136-324) None None N 1
OJEIIDGO_00903 None GH31_1(1-431) None None N 1
OJEIIDGO_00905 None GH133(51-405) None None N 1
OJEIIDGO_00933 None GH23(80-213) None None N 1
OJEIIDGO_00967 None GT4(185-327) None None N 1
OJEIIDGO_01003 None GH10(530-827) None None Y(1-25) 1
OJEIIDGO_01012 None GH13_16(35-388) None None N 1
OJEIIDGO_01015 None GT2(5-108) None None N 1
OJEIIDGO_01016 None GT2(45-271) None None N 1
OJEIIDGO_01038 None GT4(220-364) None None N 1
OJEIIDGO_01075 None GH51_1(25-513) None None Y(1-30) 1
OJEIIDGO_01100 None CE1(486-698) None None Y(1-21) 1
OJEIIDGO_01133 None GH2(321-692) None None N 1
OJEIIDGO_01214 None GH9(443-830) None None Y(1-23) 1
OJEIIDGO_01247 None GT83(3-459) None None N 1
OJEIIDGO_01282 None GT2(5-134) None None N 1
OJEIIDGO_01283 None GT2(7-175) None None N 1
OJEIIDGO_01317 None CBM13(1045-1182) None None N 1
OJEIIDGO_01321 None GH16_3(163-394) None None Y(1-25) 1
OJEIIDGO_01341 None GT2(46-189) None None N 1
OJEIIDGO_01344 None GT4(195-336) None None N 1
OJEIIDGO_01345 None GT4(195-350) None None N 1
OJEIIDGO_01347 None GT4(199-345) None None N 1
OJEIIDGO_01350 None GH188(391-622) None None N 1
OJEIIDGO_01409 None GT2(5-133) None None N 1
OJEIIDGO_01411 None CE14(44-155) None None Y(1-24) 1
OJEIIDGO_01446 None CBM48(27-107)+GH13_9(166-463) None None N 1
OJEIIDGO_01484 None GH179(12-247) None None N 1
OJEIIDGO_01520 None CE14(6-111) None None N 1
OJEIIDGO_01524 None GT4(194-342) None None N 1
OJEIIDGO_01526 None GT2(73-301) None None N 1
OJEIIDGO_01529 None GT2(6-165) None None N 1
OJEIIDGO_01533 None GT30(45-208) None None N 1
OJEIIDGO_01552 None GT2(2-82) None None N 1
OJEIIDGO_01586 None CBM9(41-212) None None N 1
OJEIIDGO_01612 None GT20(2-460) None None N 1
OJEIIDGO_01621 None PL9_2(325-677) None None Y(1-25) 1
OJEIIDGO_01631 None CBM48(29-114)+GH13_9(184-483) None None N 1
OJEIIDGO_01720 None GH144(34-450) None None Y(1-20) 1
OJEIIDGO_01721 None CBM102(60-174)+CBM102(214-332)+CBM102(366-483)+CBM102(521-644)+CBM102(682-814) None None Y(1-24) 1
OJEIIDGO_01722 None GH144(119-520) None None Y(1-22) 1
OJEIIDGO_01759 None GT83(4-374) None None N 1
OJEIIDGO_01784 None GH140(19-442) None None Y(1-19) 1
OJEIIDGO_01808 None GH171(55-398) None None Y(1-32) 1
OJEIIDGO_01877 None GT2(6-116) None None N 1
OJEIIDGO_01891 None GH88(50-438) None None N 1
OJEIIDGO_01907 None GT9(65-311) None None N 1
OJEIIDGO_01916 None GT5(5-224) None None N 1
OJEIIDGO_01949 None GT4(193-345) None None N 1
OJEIIDGO_02005 None GT2(329-477) None None N 1
OJEIIDGO_02016 None GH74(112-202) None None Y(1-23) 1
OJEIIDGO_02023 None GH179(26-250) None None N 1
OJEIIDGO_02026 None GH177(24-384) None None N 1
OJEIIDGO_02045 None GH18(21-280) None None N 1
OJEIIDGO_02059 None GT51(90-270) None None N 1
OJEIIDGO_02071 None GH3(108-332) None None N 1
OJEIIDGO_02073 None GT55(39-418) None None N 1
OJEIIDGO_02078 None GH73(34-168) None None N 1
OJEIIDGO_02099 None GH179(58-232) None None Y(1-32) 1
OJEIIDGO_02104 None GH109(3-191) None None N 1
OJEIIDGO_02141 None GT10(68-264) None None N 1
OJEIIDGO_02142 None GT8(3-241) None None N 1
OJEIIDGO_02146 None GT4(223-366) None None N 1
OJEIIDGO_02147 None GT2(7-132) None None N 1
OJEIIDGO_02148 None GT2(4-163) None None N 1
OJEIIDGO_02150 None GH43_1(17-328) None None N 1
OJEIIDGO_02206 None CE9(4-255) None None N 1
OJEIIDGO_02230 None CE1(49-268)+CE1(409-635) None None Y(1-20) 1
OJEIIDGO_02231 None GH43_12(57-351)+CBM91(387-573) None None N 1
OJEIIDGO_02246 None GH188(10-204) None None N 1
OJEIIDGO_02276 None PL33(426-582) None None Y(1-22) 1
OJEIIDGO_02278 None PL35(401-574) None None Y(1-26) 1
OJEIIDGO_02307 None GH188(2-192) None None N 1
OJEIIDGO_02319 None GT2(6-172) None None N 1
OJEIIDGO_02326 None GT4(190-338) None None N 1
OJEIIDGO_02336 None GT2(16-177) None None N 1
OJEIIDGO_02382 None GT2(49-160) None None N 1
OJEIIDGO_02415 None GT4(203-346) None None N 1
OJEIIDGO_02445 None GH179(19-342) None None N 1
OJEIIDGO_02450 None GT2(4-148) None None N 1
OJEIIDGO_02458 None GT4(192-342) None None N 1
OJEIIDGO_02469 None GH113(36-340) None None Y(1-23) 1
OJEIIDGO_02492 None GT2(52-274) None None N 1
OJEIIDGO_02512 None PL12(65-203) None None N 1
OJEIIDGO_02519 None CE15(77-445) None None Y(1-21) 1
OJEIIDGO_02581 None GT51(70-247) None None N 1
OJEIIDGO_02606 None GT119(20-203)+GT119(266-469) None None N 1
OJEIIDGO_02704 None GH109(7-164) None None N 1
OJEIIDGO_02725 None GH103(41-335) None None Y(1-31) 1
OJEIIDGO_02748 None GT4(158-302) None None N 1
OJEIIDGO_02758 None GT28(192-351) None None N 1
OJEIIDGO_02759 None GT119(32-381) None None N 1
OJEIIDGO_02858 None CBM91(2-167) None None N 1
OJEIIDGO_02880 None GT2(4-149) None None N 1
OJEIIDGO_02920 None GH13_19(77-423) None None Y(1-25) 1
OJEIIDGO_02940 None GT1(203-432) None None N 1
OJEIIDGO_02953 None GH73(150-279) None None N 1
OJEIIDGO_02972 None CE7(131-405) None None N 1
OJEIIDGO_02985 None CBM48(23-120)+GH13_11(187-536) None None N 1
OJEIIDGO_03073 None GT2(3-150) None None N 1
OJEIIDGO_03111 None CBM98(232-329)+GH13_47(635-956) None None Y(1-29) 1
OJEIIDGO_03180 None GT51(70-223) None None N 1
OJEIIDGO_03210 None GT2(4-166) None None N 1
OJEIIDGO_03211 None GT4(200-347) None None N 1
OJEIIDGO_03246 None GT32(20-96) None None N 1
OJEIIDGO_03249 None GH73(135-265) None None N 1
OJEIIDGO_03255 None CE20(136-372) None None Y(1-28) 1
OJEIIDGO_03260 None GT4(209-361) None None N 1
OJEIIDGO_03273 None CE4(22-133) None None N 1
OJEIIDGO_03279 None GH109(37-445) None None N 1
OJEIIDGO_03291 None GT2(39-200) None None N 1
OJEIIDGO_03348 None CE11(3-223) None None N 1
OJEIIDGO_03363 None CBM9(433-589) None None Y(1-22) 1
OJEIIDGO_03408 None CBM102(1-76)+GH16_3(262-495)+CBM102(613-749) None None N 1
OJEIIDGO_03415 None GT2(8-181) None None N 1
OJEIIDGO_03454 None GH188(4-151) None None N 1
OJEIIDGO_03463 None GT2(102-226) None None N 1
OJEIIDGO_03520 None GH109(42-190) None None Y(1-23) 1
OJEIIDGO_03550 None GH13_48(58-342) None None Y(1-27) 1
OJEIIDGO_03577 None CE13(57-237) None None N 1
OJEIIDGO_03642 None GH19_1(110-196) None None Y(1-23) 1
OJEIIDGO_03696 None GH73(118-246) None None N 1
OJEIIDGO_03780 None GH177(38-379) None None N 1
OJEIIDGO_03785 None GH179(92-259) None None Y(1-30) 1
OJEIIDGO_03790 None GH43_28(19-289)+CBM32(332-445) None None N 1
OJEIIDGO_03830 None GH102(195-303) None None N 1
OJEIIDGO_03853 None GT2(14-137) None None N 1
OJEIIDGO_03854 None GT2(6-124) None None N 1
OJEIIDGO_03855 None GT4(215-363) None None N 1
OJEIIDGO_03889 None GH13_3(219-437) None None N 1
OJEIIDGO_03930 None GT2(4-189) None None N 1
OJEIIDGO_03951 None GT2(9-168) None None N 1
OJEIIDGO_03962 None CE1(34-237) None None Y(1-19) 1

Visit the CAZy website if you want to know more about enzyme families and classes reported by dbCAN3. Similar to the COGclassifier, we can perform the analysis for all the MAGs, and integrate the data afterwards to achieve an overall comparison.

antisMASH

antiSMASH (antibiotics & Secondary Metabolite Analysis SHell) is a tool that detects and analyzes biosynthetic gene clusters (BGCs) in microbial genomes. These clusters are groups of co-located genes that together encode the machinery to produce secondary metabolites—specialized compounds that are not essential for basic cellular functions.

Analogous to dbCAN3, we will use the webserver to annotate one of the MAGs recovered by the pipeline. Upload the file (Download and extract the folder intermediate.tar.gz from the Moodle page) intermediate/gbks/MEGAHIT-MetaBAT2-ERR2143759.7.gbk to antiSMASH and wait for the results.

If it is kind of slow, we have performed this step for you and we stored the results within the same intermediate.tar.gz file, at intermediate/antiSMASH/index.html

Proksee

Proksee is an interactive web-based tool for visualizing, annotating, and analyzing prokaryotic genomes. Using this tool we can visualize genomes as circular or linear maps, annotate them, customize the visualization and export high-level and detailed-oriented figures.

Same as with previous applications, you just need to upload the annotated genome in FASTA or GenBank format. Here, we are going to leverage the files produced by Prokka, and hence you just need to upload the file (Download and extract the folder intermediate.tar.gz from the Moodle page) intermediate/gbks/MEGAHIT-MetaBAT2-ERR2143759.7.gbk to Proksee.

You can customize the display, add features, re-annotate the genome among many other functionalities. Unfortunately, it processes only one genome per time. From here it’s up to your creativity to take advantage from this tool.

You should be seeing this example:

Pangenome

For the past tools we have not analyzed just a random MAG, we selected this one given that it’s taxonomic annotation is shared across different samples (from BIgMAG exercise), and therefore it is interesting to establish the similarities/differences among these MAGs to enable a pangenome analyis. In this case, we study the entire gene repertoire of related MAGs (same species or genus); for our analysis we have selected some MAGs based on GTDB classification.

Next, we run the tool Roary that determines the core genome, accessory genome and unique genome. We have pre-computed the results for you following the tutorial presented by the developers of the tools, and now we are going to visualize the results using Phandango.

Drag and drop the files (Download and extract the folder intermediate.tar.gz from the Moodle page) intermediate/pangenome/JAAUTG01/workshop.newick and intermediate/pangenome/JAAUTG01/gene_presence_absence.csv to the Phandango web server and visualize the results.

It should look like this:

Usually, this analysis is carried out using a reference genome; however, given that these MAGs are not annotated at species level, and genus annotation is not informative, we do not count with a reference genome to explore the pangenome of these MAGs. You can visualize an example that includes reference genome with the files found at intermediate/pangenome/example_with_reference.